Machine Learning: Mathematical Theory and Applications

Computer lab 1

Hugo Xinghe Chen

Published

September 6, 2024

Warning

The instructions in this lab assume that the following packages are installed:

  • splines

  • caret

Packages may be installed by executing install.packages('packagename'), where 'packagename' is the name of the package, e.g. 'splines'. You may have to install additional packages to solve the computer lab. If you want this file to run locally on your computer, you have to change some paths to where the data are stored (see below).

Introduction

This computer lab treats topics such as polynomial and spline regression, regularisation, cross-validation, regression and decision trees, and classification methods.


Instructions

In this computer lab, you will work individually but you are free to discuss with other students, especially the one student you will work with on Computer labs 2 and 4! However, you have to hand in your own set of solutions. It is strictly forbidden to copy each others codes or solutions, as well as the use of AI tools (such as ChatGPT). Not obeying these instructions is regarded as academic misconduct and may have serious consequences for you.

This computer lab is worth a total of 15 marks (the subject has a maximum of 100 marks). The maximum marks for a given problem is shown within parenthesis at the top of the section. The problems you should solve are marked with the symbol 💪 and surrounded by a box. Hand in the solutions and programming codes in the form of a html document generated by Quarto. Before submitting, carefully check that the document compiles without any errors. Use properly formatted figures and programming code.

Warning

Not submitting according to the instructions may result in loss of marks. The same applies to poorly formatted reports (including poorly formatted code/figures).

1. Polynomial regression for bike rental data (2 marks)

In this problem, we consider modelling the number of rides each hour for a bike rental company. The raw data are stored in the file bike_rental_hourly.csv, which can be downloaded from the Canvas page of the course1.

The dataset contains 17379 hourly observations over the two-year period January 1, 2011 - December 31, 2012. The dataset contains many features that may be useful for predicting the number of rides, such as the hour of the day, temperature, humidity, season, weekday, etc. In this section, we consider modelling the logarithm of the number of rides as a function of the time of the day (scaled to the interval [0,1], see below). The following code reads in the data (note that you have to change to the path were you stored the file!) and creates the variables of interest (log_cnt and hour).

rm(list=ls()) # Remove variables 
cat("\014") # Clean workspace
bike_data <- read.csv('/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/1/bike_rental_hourly.csv')
str(bike_data)
'data.frame':   17379 obs. of  17 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
head(bike_data)
  instant     dteday season yr mnth hr holiday weekday workingday weathersit
1       1 2011-01-01      1  0    1  0       0       6          0          1
2       2 2011-01-01      1  0    1  1       0       6          0          1
3       3 2011-01-01      1  0    1  2       0       6          0          1
4       4 2011-01-01      1  0    1  3       0       6          0          1
5       5 2011-01-01      1  0    1  4       0       6          0          1
6       6 2011-01-01      1  0    1  5       0       6          0          2
  temp  atemp  hum windspeed casual registered cnt
1 0.24 0.2879 0.81    0.0000      3         13  16
2 0.22 0.2727 0.80    0.0000      8         32  40
3 0.22 0.2727 0.80    0.0000      5         27  32
4 0.24 0.2879 0.75    0.0000      3         10  13
5 0.24 0.2879 0.75    0.0000      0          1   1
6 0.24 0.2576 0.75    0.0896      0          1   1
bike_data$log_cnt <- log(bike_data$cnt)
bike_data$hour <- bike_data$hr/23 # transform [0, 23] to [0, 1]. 0 is midnight, 1 is 11 PM
plot(log_cnt ~ hour, data = bike_data, col = "cornflowerblue")

The number of rides seem to peak in the morning (8 AM \(8/23 \approx 0.35\) ) and after work (6 PM \(18/23 \approx 0.78\)).

We start by fitting a polynomial of order 2 to a training dataset consisting of Feb 1, 2011 - March 31, 2011. We use the following two months (April 1, 2011 - May 31, 2011) as test dataset to evaluate the predictions. The following code creates the training and test datasets.

bike_data_train <- bike_data[bike_data$dteday >= as.Date("2011-02-01") & bike_data$dteday <=  as.Date("2011-03-31"), ]
bike_data_test <- bike_data[bike_data$dteday >= as.Date("2011-04-01") & bike_data$dteday <=  as.Date("2011-05-31"), ]

The following code estimates the model (a polynomial of order 2) using least squares and computes the root mean squared error (RMSE) for both the training data and the test data. Moreover, the code plots the training data, test data and the fitted polynomial in the same plot.

y_train <- bike_data_train$log_cnt
y_test <- bike_data_test$log_cnt
p <- 2 # Order of polynomial
X_train <- cbind(1, poly(bike_data_train$hour, p, raw = TRUE, simple = TRUE))

# Design matrix / matrix of features (including intercept)
beta_hat <- solve(t(X_train)%*%X_train)%*%t(X_train)%*%y_train

# Predict in-sample and compute RMSE
y_hat_train <- X_train%*%beta_hat 
RMSE_train <- sqrt(sum((y_train - y_hat_train)^2)/length(y_train))

# Predict out-of-sample and compute RMSE
X_test <- cbind(1, poly(bike_data_test$hour, p, raw = TRUE, simple = TRUE))
y_hat_test <- X_test%*%beta_hat
RMSE_test <- sqrt(sum((y_test - y_hat_test)^2)/length(y_test))

# Plot training data, test data, and fit on a fine grid.
plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))
lines(bike_data_test$hour, bike_data_test$log_cnt, type = "p", col = "lightcoral")
hours_grid <- seq(0, 1, length.out = 1000)
X_grid <- cbind(1, poly(hours_grid, p, raw = TRUE, simple = TRUE))
y_hat_grid <- X_grid%*%beta_hat
lines(hours_grid, y_hat_grid, lty = 1, col = "lightcoral")
legend(x = "topleft", pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("cornflowerblue", "lightcoral",  "lightcoral"), legend=c("Train", "Test", "Fitted curve"))

💪 Problem 1.1

Fit a polynomial of order 8 using least squares, without using R functions such as lm(). Plot the training data, test data and the fitted polynomial in the same plot.

# Generate polynomial features up to order 8 for training data
p <- 8
X_train1 <- cbind(1, poly(bike_data_train$hour, p, raw = TRUE, simple = TRUE))
y_train1 <- bike_data_train$log_cnt

# Calculate the coefficients of the polynomial using least squares
beta_hat1 <- solve(t(X_train1) %*% X_train1) %*% t(X_train1) %*% y_train1

X_test1 <- cbind(1, poly(bike_data_test$hour, p, raw = TRUE, simple = TRUE))
y_test1 <- bike_data_test$log_cnt

# Predictions
y_hat_train1 <- X_train1 %*% beta_hat1
y_hat_test1 <- X_test1 %*% beta_hat1

RMSE_train1 <- sqrt(sum((y_train1 - y_hat_train1)^2) / length(y_train1))
RMSE_test1 <- sqrt(sum((y_test1 - y_hat_test1)^2) / length(y_test1))

plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))

points(bike_data_test$hour, bike_data_test$log_cnt, col = "lightcoral")
hours_grid1 <- seq(0, 1, length.out = 1000)
X_grid1 <- cbind(1, poly(hours_grid1, p, raw = TRUE, simple = TRUE))
y_hat_grid1 <- X_grid1 %*% beta_hat1
lines(hours_grid1, y_hat_grid1, lty = 1, col = "green",lwd = 2)
legend(x = "topleft", pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("cornflowerblue", "lightcoral", "lightcoral"), legend = c("Train", "Test", "Fitted curve"))

💪 Problem 1.2

Fit polynomials of varying order 1-10 using least squares, without using R functions such as lm(). Compute the RMSEs for the training and test data and plot them on the same figure as a function of the polynomial order. Are you underfitting or overfitting the data?

RMSE_train <- numeric(10)
RMSE_test <- numeric(10)

y_train <- bike_data_train$log_cnt
y_test <- bike_data_test$log_cnt

# Polynomial regression for orders 1 to 10
for (p in 1:10) {
    X_train <- cbind(1, poly(bike_data_train$hour, p, raw = TRUE, simple = TRUE))
    
    beta_hat <- solve(t(X_train) %*% X_train) %*% t(X_train) %*% y_train

    X_test <- cbind(1, poly(bike_data_test$hour, p, raw = TRUE, simple = TRUE))

    y_hat_train <- X_train %*% beta_hat
    y_hat_test <- X_test %*% beta_hat

    RMSE_train[p] <- sqrt(sum((y_train - y_hat_train)^2) / length(y_train))
    RMSE_test[p] <- sqrt(sum((y_test - y_hat_test)^2) / length(y_test))
}

plot(1:10, RMSE_train, type = "b", col = "blue", pch = 16, xlab = "Polynomial Order", ylab = "RMSE", main = "RMSE vs. Polynomial Order", ylim = c(0, max(c(RMSE_train, RMSE_test))))

points(1:10, RMSE_test, type = "b", col = "red", pch = 16)
legend("topright", legend = c("Training Data", "Test Data"), col = c("blue", "red"), pch = 16, bg = "white")

We can see there is a significant gap between the training and test RMSE values, the model is overfitting the data.

💪 Problem 1.3

Polynomials are global functions and their fit may be sensitive to outliers. Local fitting methods can sometimes be more robust. One such method is the loess method, which is a nonparametric method that fits locally weighted regressions to subsets of points and subsequently combines them to obtain a global fit. Use the loess function in R with the standard settings to fit a locally weighted regression to the training data. Is this method better than that in Problem 1.1? Plot the training data, test data and both fitted curves in the same plot.

Tip

The predict() function can be used on the object returned by loess().

# Fit regression using loess
loess_fit <- loess(log_cnt ~ hour, data = bike_data_train)

y_hat_train_loess <- predict(loess_fit, newdata = bike_data_train)
y_hat_test_loess <- predict(loess_fit, newdata = bike_data_test)

RMSE_train_loess <- sqrt(mean((y_train1 - y_hat_train_loess)^2))
RMSE_test_loess <- sqrt(mean((y_test1 - y_hat_test_loess)^2))

plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))

points(bike_data_test$hour, bike_data_test$log_cnt, col = "lightcoral")
#Problem1.1
lines(hours_grid1, y_hat_grid1, lty = 1, col = "green",lwd = 2)
# Loess fit
lines(bike_data_train$hour, y_hat_train_loess, col = "purple", lwd = 2, lty = 2)  

legend(x = "topleft", pch = c(1, 1, NA, NA), lty = c(NA, NA, 1, 2), col = c("cornflowerblue", "lightcoral", "green", "purple"), legend = c("Train", "Test", "Polynomial Fit", "Loess Fit"))

cat("RMSE for Polynomial Regression (Train): ", RMSE_train1, "\n")
RMSE for Polynomial Regression (Train):  0.7232024 
cat("RMSE for Polynomial Regression (Test): ", RMSE_test1, "\n")
RMSE for Polynomial Regression (Test):  1.021449 
cat("RMSE for Loess Fit (Train): ", RMSE_train_loess, "\n")
RMSE for Loess Fit (Train):  0.8989078 
cat("RMSE for Loess Fit (Test): ", RMSE_test_loess, "\n")
RMSE for Loess Fit (Test):  1.120946 
As RMSE of Loess Fit are higher than the ones of Polynomial Regression, so the Loess Fit method is not better than that in Problem1.1 .

2. Regularised spline regression for bike rental data (3 marks)

Using the same training and test datasets as above, we will fit spline regressions with L1 and L2 regularisations (also known as penalties).

We create a natural cubic spline basis function for the variable hour with 25 equally spaced knots between 0.05 and 0.95. We first fit a regression using least squares without regularisation. The following code does this and plots the spline fit together with the training and test data. Note that we are not adding an intercept to the design matrix, since this is taken care of by the argument intercept=TRUE.

suppressMessages(library(splines))
knots <- seq(0.05, 0.95, length.out = 25)
X_train <- ns(bike_data_train$hour, knots = knots, intercept = TRUE)
X_test <- ns(bike_data_test$hour, knots = knots, intercept = TRUE)
beta_hat <- solve(t(X_train)%*%X_train)%*%t(X_train)%*%y_train

# Plot training data, test data, and spline fit on a fine grid.
plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))

lines(bike_data_test$hour, bike_data_test$log_cnt, type = "p", col = "lightcoral")

hours_grid <- seq(0, 1, length.out = 1000)
X_grid <- ns(hours_grid, knots = knots, intercept = TRUE) # cbind(1, ns(hours_grid, knots = knots))

y_hat_spline_grid <- X_grid%*%beta_hat
lines(hours_grid, y_hat_spline_grid, lty = 1, col = "green", lwd = 1.5)
legend(x = "topleft", pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("cornflowerblue", "lightcoral",  "lightcoral"), legend=c("Train", "Test", "Spline"))

We have deliberately chosen too many knots and no regularisation so that the fit has a large variance, which is clearly evident from the figure above. Your task in the next problem is to fit a regularised spline regression.

💪 Problem 2.1

Fit a spline regression to the training data with an L2 regularisation using a suitable value of \(\lambda\), without using R functions such as glmnet(). Plot the fit together with the training and test data.

Tip

The least squares estimator when using an L2 penalty is known as the Ridge regression estimator. Moreover, to determine a suitable value of \(\lambda\), one approach is to fit several models (using different \(\lambda\) values, e.g. lambda_grid <- seq(0, 1, length.out=100)) and choose the value of \(\lambda\) that minimises the root mean squared error of the test data.

library(splines)

# Define the knots for the spline
knots2 <- seq(0.05, 0.95, length.out = 25)

# Generate natural cubic spline basis functions
X_train2 <- ns(bike_data_train$hour, knots = knots2, intercept = TRUE)
X_test2 <- ns(bike_data_test$hour, knots = knots2, intercept = TRUE)
y_train2 <- bike_data_train$log_cnt
y_test2 <- bike_data_test$log_cnt

lambda_grid <- seq(0, 1, length.out = 100)
min_rmse <- Inf
best_lambda <- NULL

n <- nrow(X_train2)
p <- ncol(X_train2)
I <- diag(p)
  
for (lambda in lambda_grid) {
  beta_hat2 <- solve(t(X_train2) %*% X_train2 + lambda * I) %*% t(X_train2) %*% y_train2

  y_hat_test2 <- X_test2 %*% beta_hat2
  
  RMSE_test2 <- sqrt(mean((y_test2 - y_hat_test2)^2))

  # Find the lowest RMSE
  if (RMSE_test2 < min_rmse) {
    min_rmse <- RMSE_test2
    best_lambda <- lambda
  }
}

# Re-fit
beta_hat2 <- solve(t(X_train2) %*% X_train2 + best_lambda * I) %*% t(X_train2) %*% y_train2

plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))

lines(bike_data_test$hour, bike_data_test$log_cnt, type = "p", col = "lightcoral")

hours_grid <- seq(0, 1, length.out = 1000)
X_grid2 <- ns(hours_grid, knots = knots2, intercept = TRUE)

y_hat_spline_grid2 <- X_grid2 %*% beta_hat2
lines(hours_grid, y_hat_spline_grid2, lty = 1, col = "green", lwd = 2)
legend(x = "topleft", pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("cornflowerblue", "lightcoral", "green"), legend = c("Train", "Test", "Regularized Spline"))

#cat("Best lambda value: ", best_lambda, "\n")
#cat("RMSE for Test data with best lambda: ", min_rmse, "\n")

💪 Problem 2.2

Use the package glmnet to fit a spline regression with an L2 regularisation using the same basis functions as the previous problem. Find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data. Plot the fit together with the training and test data.

Tip

The help() function will be very useful here. Another useful function is cv.glmnet(), where cv stands for cross-validation. The argument alpha (to the cv.glmnet() function) is key. Finally, the predict() function can be used on the object returned by cv.glmnet() and the argument s (to the predict() function) controls which \(\lambda\) to use.

library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
# Set up the alpha value for L2
alpha <- 0

lambda_seq <- 10^seq(0, -2, by = -0.1)

# Cross-validation
cv_fit <- cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds = 10)

optimal_lambda <- cv_fit$lambda.min + cv_fit$lambda.1se

# Fit the final model
final_model <- glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)

y_hat_train <- predict(final_model, s = optimal_lambda, newx = X_train2)
y_hat_test <- predict(final_model, s = optimal_lambda, newx = X_test2)

rmse_train <- sqrt(mean((y_train2 - y_hat_train)^2))
rmse_test <- sqrt(mean((y_test2 - y_hat_test)^2))

plot(log_cnt ~ hour, data = bike_data_train, col = "cornflowerblue", ylim = c(0, 8))
lines(bike_data_test$hour, bike_data_test$log_cnt, type = "p", col = "lightcoral")

hours_grid <- seq(0, 1, length.out = 1000)
X_grid <- ns(hours_grid, knots = knots2, intercept = TRUE)

y_hat_spline_grid <- predict(final_model, s = optimal_lambda, newx = X_grid)
lines(hours_grid, y_hat_spline_grid, lty = 1, col = "green", lwd = 2)
legend(x = "topleft", pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("cornflowerblue", "lightcoral", "green"), legend = c("Train", "Test", "Regularized Spline"))

# Print the optimal lambda and RMSE for training and test data
cat("Optimal lambda using the one-standard deviation: ", optimal_lambda, "\n")
Optimal lambda using the one-standard deviation:  0.4379179 
cat("RMSE for Training data with optimal lambda: ", rmse_train, "\n")
RMSE for Training data with optimal lambda:  0.7112003 
cat("RMSE for Test data with optimal lambda: ", rmse_test, "\n")
RMSE for Test data with optimal lambda:  1.00077 

💪 Problem 2.3

Repeat Problem 2.2, however, use the optimal \(\lambda\) that minimises the mean cross-validated error. Compare the RMSE for the training and test data to those in Problem 2.2.

Tip

The help() function will be very useful here.

alpha <- 0
lambda_seq <- 10^seq(0, -2, by = -0.1)

cv_fit <- cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds = 10)

# Mean cv
optimal_lambda <- cv_fit$lambda.min

final_model <- glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)

y_hat_train <- predict(final_model, s = optimal_lambda, newx = X_train2)
y_hat_test <- predict(final_model, s = optimal_lambda, newx = X_test2)

rmse_train_optimal <- sqrt(mean((y_train2 - y_hat_train)^2))
rmse_test_optimal <- sqrt(mean((y_test2 - y_hat_test)^2))

cat("RMSE for Training data using mean cross-validation error (2.3): ", rmse_train_optimal, "\n")
RMSE for Training data using mean cross-validation error (2.3):  0.6913383 
cat("RMSE for Test data using mean cross-validation error (2.3): ", rmse_test_optimal, "\n")
RMSE for Test data using mean cross-validation error (2.3):  1.004461 
cat("RMSE for Training data using one-standard deviation (2.2): ", rmse_train, "\n")
RMSE for Training data using one-standard deviation (2.2):  0.7112003 
cat("RMSE for Test data using one-standard deviation (2.2): ", rmse_test, "\n")
RMSE for Test data using one-standard deviation (2.2):  1.00077 
We can see that they have close RMSE to each others. The one using mean cross-validation error performs better on the training data and worse on the test data than the one using one-standard deviation.

💪 Problem 2.4

Repeat Problem 2.2 using L1 regularisation. Compare the RMSE for the training and test data to those in Problem 2.2.

Tip

The help() function will be very useful here to set the argument alpha (to the cv.glmnet() function).

# Set up alpha value for L1 (lasso)
alpha <- 1
lambda_seq <- 10^seq(0, -2, by = -0.1)

cv_fit <- cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds = 10)

# As 2.2 using one-standard deviation
optimal_lambda <- cv_fit$lambda.min + cv_fit$lambda.1se

final_model <- glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)

y_hat_train <- predict(final_model, s = optimal_lambda, newx = X_train2)
y_hat_test <- predict(final_model, s = optimal_lambda, newx = X_test2)

rmse_train_lasso <- sqrt(mean((y_train2 - y_hat_train)^2))
rmse_test_lasso <- sqrt(mean((y_test2 - y_hat_test)^2))

cat("RMSE for Training data with L1 regularization: ", rmse_train_lasso, "\n")
RMSE for Training data with L1 regularization:  0.7144238 
cat("RMSE for Test data with L1 regularization: ", rmse_test_lasso, "\n")
RMSE for Test data with L1 regularization:  1.00522 
cat("RMSE for Training data in Problem 2.2: ", rmse_train, "\n")
RMSE for Training data in Problem 2.2:  0.7112003 
cat("RMSE for Test data in Problem 2.2: ", rmse_test, "\n")
RMSE for Test data in Problem 2.2:  1.00077 
We can see that the one using L2 (2.2) performs better than the one using L1 (lasso) regularization.

3. Regularised regression for bike rental data with more features and data (2 marks)

We now consider the full dataset and use many more features. We use the first one and a half years (Jan 1, 2011 - May 31, 2012) of data for training and the last half year (June 1, 2012- Dec 31, 2012) for testing.

We will use the following categorical features:

  • weathersit: Clear (1), cloudy (2), light rain (3), heavy rain (4).

  • weekday: Sun (0), Mon (1), Tue (2), Wed (3), Thu (4), Fri (5), Sat (6).

  • season: Winter (1), spring (2), summer (3), fall (4).

We will use the common approach in statistics that creates \(K-1\) dummy variables for a categorical variable with \(K\) levels. The first level is the reference category. The following code constructs these so called one-hot encodings and adds them to the dataset bike_data.

# One hot for weathersit
one_hot_encode_weathersit <- model.matrix(~ as.factor(weathersit) - 1,data = bike_data)
one_hot_encode_weathersit  <- one_hot_encode_weathersit[, -1] # Remove reference category
colnames(one_hot_encode_weathersit) <- c('cloudy', 'light rain', 'heavy rain')
bike_data <- cbind(bike_data, one_hot_encode_weathersit)

# One hot for weekday
one_hot_encode_weekday <- model.matrix(~ as.factor(weekday) - 1,data = bike_data)
one_hot_encode_weekday  <- one_hot_encode_weekday[, -1] # Remove reference category
colnames(one_hot_encode_weekday) <- c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
bike_data <- cbind(bike_data, one_hot_encode_weekday)

# One hot for season
one_hot_encode_season <- model.matrix(~ as.factor(season) - 1,data = bike_data)
one_hot_encode_season  <- one_hot_encode_season[, -1] # Remove reference category
colnames(one_hot_encode_season) <- c('Spring', 'Summer', 'Fall')
bike_data <- cbind(bike_data, one_hot_encode_season)

💪 Problem 3.1

Fit a spline regression to the training data with an L1 regularisation. Find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data. For the spline, use the splines package in R to create natural cubic splines basis functions of the variable hour with 10 degrees of freedom, i.e. use the ns() function with df=10 as input argument. Moreover, set intercept=FALSE and add it manually to your design matrix instead.

The following variables should be included in the regression:

  • Response variable: log_cnt.

  • Features: hour (via the cubic spline described above), yr, holiday, workingday, temp, atemp, hum, windspeed, and all the dummy variables created above.

Tip

First create the design matrix of interest (you have to exclude the original variables you did the one-hot coding for). When creating the spline basis for the test data, you will need to use the same knots as when constructing the basis functions for the training dataset. If the spline is stored in a variable spline_basis, you can extract the knots by using knots<-attr(spline_basis, "knots") and then use the knots argument when constructing the design matrix for the test data, e.g. ns(bike_all_data_test$hour, df=10, knots=knots, intercept = FALSE).

# Split train and test
bike_data_train <- bike_data[bike_data$dteday >= as.Date("2011-01-01") & bike_data$dteday <= as.Date("2012-05-31"), ]
bike_data_test <- bike_data[bike_data$dteday >= as.Date("2012-06-01") & bike_data$dteday <= as.Date("2012-12-31"), ]

spline_basis_train <- ns(bike_data_train$hour, df=10, intercept = FALSE)
knots <- attr(spline_basis_train, "knots")

# Construct design matrix
design_matrix_train <- cbind(
  as.matrix(spline_basis_train),
  as.matrix(bike_data_train[, c("yr", "holiday", "workingday", "temp", "atemp", "hum", "windspeed")]),
  as.matrix(bike_data_train[, c("cloudy", "light rain", "heavy rain", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Spring", "Summer", "Fall")])
)
response_train <- bike_data_train$log_cnt

# Fit model using cross-validation
cv_model <- cv.glmnet(
  x = design_matrix_train,
  y = response_train,
  alpha = 1,  # L1
  family = "gaussian",
  nfolds = 10
)

# Extract the optimal lambda using the one-standard deviation rule
lambda_optimal <- cv_model$lambda[which.min(cv_model$cvm)]

# Test part
spline_basis_test <- ns(bike_data_test$hour, df=10, knots=knots, intercept = FALSE)

design_matrix_test <- cbind(
  as.matrix(spline_basis_test),
  as.matrix(bike_data_test[, c("yr", "holiday", "workingday", "temp", "atemp", "hum", "windspeed")]),
  as.matrix(bike_data_test[, c("cloudy", "light rain", "heavy rain", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Spring", "Summer", "Fall")])
)
response_test <- bike_data_test$log_cnt

# Fit the model with the optimal lambda
model <- glmnet(
  x = design_matrix_train,
  y = response_train,
  alpha = 1,
  lambda = lambda_optimal
)

pred_train <- predict(model, newx = design_matrix_train)
rmse_train <- sqrt(mean((response_train - pred_train)^2))

pred_test <- predict(model, newx = design_matrix_test)
rmse_test <- sqrt(mean((response_test - pred_test)^2))

cat("Optimal Lambda: ", lambda_optimal,"\n")
Optimal Lambda:  0.001217784 
cat("Training RMSE: ", rmse_train,"\n")
Training RMSE:  0.6431978 
cat("Test RMSE: ", rmse_test,"\n")
Test RMSE:  0.6253389 

💪 Problem 3.2

Which three features in Problem 3.1 seem to be the most important?

Tip

To explore which variables are most important, you can re-estimate the model in Problem 3.1 without cross-validation (using glmnet()). Note that you can apply the plot() function to an object returned by glmnet() using the arguments xvar="lambda" or xvar="norm", and label =TRUE", which allows you to visualise the lasso path.

model_full <- glmnet(
  x = design_matrix_train,
  y = response_train,
  alpha = 1 
)

coef_matrix <- as.matrix(coef(model_full))

feature_names <- rownames(coef_matrix)

plot(model_full, xvar = "lambda", label = TRUE, main = "Lasso Path with Feature Names")

non_zero_features <- which(apply(coef_matrix, 1, function(x) any(x != 0)))
non_zero_names <- feature_names[non_zero_features]

legend_colors <- 1:length(non_zero_names)
legend_labels <- non_zero_names

legend("topright", legend = legend_labels, col = legend_colors, lty = 1, cex = 0.7, title = "Features")

#coef_matrix <- as.matrix(coef(model_full))

#non_zero_features <- which(coef_matrix != 0)
#non_zero_coef <- coef_matrix[non_zero_features, , drop = FALSE]

#feature_names <- rownames(coef_matrix)
#non_zero_names <- feature_names[non_zero_features]
#non_zero_coef <- non_zero_coef[non_zero_features]

#importance <- abs(non_zero_coef)
#ranked_features <- order(importance, decreasing = TRUE)
#top_features <- non_zero_names[ranked_features][1:3]

#print(top_features)

💪 Problem 3.3

Carry out a residual analysis in Problem 3.1 for the training data. What can you say about the assumption of independent residuals? Repeat the same for the test data.

Tip

Plot the autocorrelation function of the residuals to validate the independence assumption.

model_full <- glmnet(
  x = design_matrix_train,
  y = response_train,
  alpha = 1, 
  lambda = lambda_optimal
)

pred_train <- predict(model_full, newx = design_matrix_train, s = lambda_optimal)

# Calculate residuals
residuals_train <- response_train - pred_train

# Flatten residuals to a vector
residuals_train <- as.vector(residuals_train)

acf(residuals_train, main = "ACF of Residuals (Training Data)")

The residuals appear to have significant autocorrelation at multiple lags. This means the residuals are not independent. We can see there is a cyclical pattern in the autocorrelations. Since the residuals are not independent, the assumption of independent residuals is violated for the training data.
pred_test <- predict(model_full, newx = design_matrix_test, s = lambda_optimal)

residuals_test <- response_test - pred_test

residuals_test <- as.vector(residuals_test)

acf(residuals_test, main = "ACF of Residuals (Test Data)")

The residuals appear to have significant autocorrelation at multiple lags. This means the residuals are not independent. We can see there is a cyclical pattern in the autocorrelations. Since the residuals are not independent, the assumption of independent residuals is violated for the test data

4. Regularised time series regression for bike rental data (2 marks)

In this section we will consider the time series nature of the data.

💪 Problem 4.1

Plot a time series plot of the response in the original scale (i.e. counts and not log-counts) for the last week of the test data (last \(24\times 7\) observations). In the same figure, plot a time series plot of the fitted values (in the original scale) from Problem 3.1. Comment on the fit.

last_week_indices <- (nrow(bike_data_test) - 24*7 + 1):nrow(bike_data_test)
last_week_actual <- bike_data_test$cnt[last_week_indices]

last_week_fitted <- exp(pred_test[last_week_indices])

plot(last_week_actual, type = "l", col = "blue", lwd = 2, ylab = "Counts", xlab = "Time", main = "Actual vs Fitted Counts for the Last Week")
lines(last_week_fitted, col = "red", lwd = 2, lty = 2)
legend("topright", legend = c("Actual Counts", "Fitted Counts"), col = c("blue", "red"), lwd = 2, lty = c(1, 2))

According to the output, we can say that our fit has almost the same structure like the actual values, but it's obviously not accurate.

💪 Problem 4.2

Add time series effects to your model by including some lagged values of the response as features. Use the first four hourly lags of log_cnt plus the 24th hourly lag as features, in addition to all other features in Problem 3.1. Fit the model using an L1 regularisation and find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data and compare to Problem 3.1. Are the residuals from this new model more adequate?

Tip

The function mutate() from the dplyr is useful for adding columns to a data frame. Moreover, the function lags(x, k) lags the time series x by k steps.

Notice that some observations are lost when lagging, i.e. NA values are introduced. You can remove these observations from the dataset.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
bike_data <- bike_data %>%
  mutate(lag1 = lag(log_cnt, 1),
         lag2 = lag(log_cnt, 2),
         lag3 = lag(log_cnt, 3),
         lag4 = lag(log_cnt, 4),
         lag24 = lag(log_cnt, 24))

bike_data <- na.omit(bike_data)

bike_data_train <- bike_data[bike_data$dteday >= as.Date("2011-01-01") & bike_data$dteday <=  as.Date("2012-05-31"), ]
bike_data_test <- bike_data[bike_data$dteday >= as.Date("2012-06-01") & bike_data$dteday <=  as.Date("2012-12-31"), ]

y_train <- bike_data_train$log_cnt
y_test <- bike_data_test$log_cnt

spline_train <- ns(bike_data_train$hour, df = 10, intercept = FALSE)
spline_test <- ns(bike_data_test$hour, df = 10, knots = attr(spline_train, "knots"), intercept = FALSE)

# Prepare the design matrices
X_train <- cbind(spline_train,
                 bike_data_train[, c('yr','holiday','workingday','temp','atemp','hum','windspeed', 'cloudy', 'light rain', 'heavy rain', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Spring', 'Summer', 'Fall', 'lag1', 'lag2', 'lag3', 'lag4', 'lag24')])

X_test <- cbind(spline_test,
                bike_data_test[, c('yr','holiday','workingday','temp','atemp','hum','windspeed', 'cloudy', 'light rain', 'heavy rain', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Spring', 'Summer', 'Fall', 'lag1', 'lag2', 'lag3', 'lag4', 'lag24')])

X_train <- as.matrix(X_train)
X_test <- as.matrix(X_test)

# Fit Lasso using cv
cv_fit_lasso <- cv.glmnet(X_train, y_train, alpha = 1, nfolds = 10)
opt_lambda_lasso <- cv_fit_lasso$lambda.1se

y_hat_train_lasso42 <- predict(cv_fit_lasso, newx = X_train, s = opt_lambda_lasso)
y_hat_test_lasso42 <- predict(cv_fit_lasso, newx = X_test, s = opt_lambda_lasso)

RMSE_train_lasso <- sqrt(mean((y_train - y_hat_train_lasso42)^2))
RMSE_test_lasso <- sqrt(mean((y_test - y_hat_test_lasso42)^2))

residuals_train <- y_train - y_hat_train_lasso42
residuals_test <- y_test - y_hat_test_lasso42

par(mfrow = c(2, 1))
plot(residuals_train, main = "Residuals - Training Data", ylab = "Residuals", xlab = "Index", col = "blue", pch = 20)
abline(h = 0, col = "red")
plot(residuals_test, main = "Residuals - Test Data", ylab = "Residuals", xlab = "Index", col = "blue", pch = 20)
abline(h = 0, col = "red")

cat("RMSE training 3.1: ", rmse_train,"\n")
RMSE training 3.1:  0.6431978 
cat("RMSE test 3.1: ", rmse_test,"\n")
RMSE test 3.1:  0.6253389 
cat("RMSE training 4.1: ", RMSE_train_lasso,"\n")
RMSE training 4.1:  0.4257333 
cat("RMSE test 4.1: ", RMSE_test_lasso,"\n")
RMSE test 4.1:  0.3635715 
The residuals from this new model (4.1) are both smaller to those of the model of 3.1. So the new model is more adequate.

💪 Problem 4.3

Add the predictions from Problem 4.2 to the figure you created in Problem 4.1. Did the predictions improve by adding lags of the response variable?

# 3 needed counts
counts <-exp(y_test[last_week_indices])
counts_4_1 <- last_week_fitted
counts_4_2 <- exp(y_hat_test_lasso42[last_week_indices])

time_points <- 1:(24*7)
plot(time_points, counts, type = "l", col = "blue", lty = 1, lwd = 2, ylab = "Counts", xlab = "Time")

lines(time_points, counts_4_1, col = "red", lty = 1, lwd = 2)
lines(time_points, counts_4_2, col = "green", lty = 1, lwd = 2)
legend("topleft", legend = c("Actual Counts", "4.1 Counts", "4.2 Counts"), col = c("blue", "red", "green"), lty = c(1, 1), lwd = 2)

According to the plos, the predictions of 4.2 (green) seem to be more accurate than the ones of 4.1 (red). So by adding lags of the response variable, we can improve our predictions.

5. Regression trees for bike rental data (2 marks)

Another alternative to the semi-parametric approach above is to fit a regression tree.

💪 Problem 5.1

Using the training dataset from Problem 4.2 (that also includes lagged values of the response as features), fit a regression tree using the tree package. Experiment with the settings to see how changing them affects the results.

library(tree)

# Combine X_train and y_train into a data frame
train_data <- data.frame(X_train, log_cnt = y_train)

# Fit the regression tree model
tree_model <- tree(log_cnt ~ ., data = train_data)

summary(tree_model)

Regression tree:
tree(formula = log_cnt ~ ., data = train_data)
Variables actually used in tree construction:
[1] "lag1"  "X10"   "X8"    "lag4"  "lag24"
Number of terminal nodes:  9 
Residual mean deviance:  0.3258 = 3991 / 12250 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.51400 -0.34300  0.03172  0.00000  0.38000  2.33700 
pred_train_tree <- predict(tree_model, newdata = train_data)
rmse_train_tree <- sqrt(mean((y_train - pred_train_tree)^2))

# Test part
test_data <- data.frame(X_test)

pred_test_tree <- predict(tree_model, newdata = test_data)
rmse_test_tree <- sqrt(mean((y_test - pred_test_tree)^2))

cat("Training RMSE: ", rmse_train_tree, "\n")
Training RMSE:  0.5706015 
cat("Test RMSE: ", rmse_test_tree, "\n")
Test RMSE:  0.6000504 

💪 Problem 5.2

Plot the tree structure in Problem 5.1.

#plot(tree_model)
#text(tree_model, pretty = 0)

# Better plot
if (!require(rpart)) install.packages("rpart")
Loading required package: rpart
if (!require(rpart.plot)) install.packages("rpart.plot")
Loading required package: rpart.plot
library(rpart)
library(rpart.plot)

# Fit the regression
rpart_fit <- rpart(log_cnt ~ ., data = bike_data_train, method = "anova")

rpart.plot(rpart_fit, main = "Regression Tree for Bike Rental Data")

💪 Problem 5.3

Add the predictions from Problem 5.1 to the figure you created in Problem 4.3. Comment on the fit of the regression tree compared to that of the semi-parametric spline approach.

last_week_indices <- 1:(24*7) 

counts_tree <- exp(pred_test_tree[last_week_indices])

time_points <- 1:(24*7)
plot(time_points, counts, type = "l", col = "blue", lty = 1, lwd = 2, ylab = "Counts", xlab = "Time")

lines(time_points, counts_4_1, col = "red", lty = 1, lwd = 2)
lines(time_points, counts_4_2, col = "green", lty = 1, lwd = 2)

lines(time_points, counts_tree, col = "purple", lty = 1, lwd = 2)

legend("topleft", legend = c("Actual Counts", "4.1 Counts", "4.2 Counts", "Tree Counts"), col = c("blue", "red", "green", "purple"), lty = c(1, 1, 1, 1), lwd = 2)

We can see that the predictions of 5.1 tree model are not precious at all which is exactly the characteristic of it. But we can still see that it's sensible to peaks and valleys.

6. Logistic regression for classifying spam emails (2 marks)

The dataset spam_ham_emails.RData consists of \(n=4601\) spam (\(y=1\)) and ham (no spam, \(y=0\)) emails with corresponding 15 features2.

Most of the features are continuous real variables in the interval \([0, 100]\), with values corresponding to the percentage of occurrence of a specific word or character in the email. There are also a few features that capture the tendency to use many capital letters. The features are on different scales so we will standardise them to have zero mean and unit variance. The following code reads in the data (note that you have to change to the path were you stored the file!), standardises the features, and codes the response variable as a factor.

rm(list=ls()) # Remove variables 
cat("\014") # Clean workspace
load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/1/spam_ham_emails.RData')

Spam_ham_emails[, -1] <- scale(Spam_ham_emails[, -1])
Spam_ham_emails['spam'] <- as.factor(Spam_ham_emails['spam'] == 1) # Changing from 1->TRUE, 0->FALSE

levels(Spam_ham_emails$spam) <- c("not spam", "spam")
head(Spam_ham_emails)
      spam       over     remove   internet        free        hpl          X.
1 not spam -0.3502281 -0.2917622 -0.2625330 -0.30134485  4.2669752 -0.32987644
2     spam  1.6583607 -0.2917622  0.4106637  1.03071019 -0.2992074  0.38364598
3 not spam -0.3502281 -0.2917622 -0.2625330 -0.30134485  2.6659926 -0.32987644
4     spam  0.6358064 -0.2917622 -0.2625330 -0.30134485 -0.2992074 -0.28451505
5     spam  0.3436480  0.1936234 -0.2625330 -0.07126262 -0.2992074 -0.06996793
6     spam -0.3502281  0.9855684  0.9841276 -0.30134485 -0.2992074  0.17522878
        X..1   CapRunMax CapRunTotal        our         hp   george      X1999
1 -0.3083214 -0.17021174  0.02096274 -0.4642639  3.9791176 -0.22787  4.9900587
2  0.8751730 -0.08811470  0.01271665 -0.4642639 -0.3287789 -0.22787 -0.3234205
3 -0.3083214 -0.22665345 -0.40618481 -0.4642639 -0.3287789 -0.22787  2.7702052
4 -0.3083214 -0.09837683 -0.12416847 -0.4642639 -0.3287789 -0.22787 -0.3234205
5  0.9239769  0.40959862  0.75651413 -0.1817414 -0.3287789 -0.22787 -0.3234205
6  1.3672790 -0.17021174 -0.05655052  0.6658261 -0.3287789 -0.22787 -0.3234205
           re       edu
1  0.59185916 -0.197366
2 -0.03086294 -0.197366
3 -0.29774385 -0.197366
4 -0.29774385 -0.197366
5 -0.29774385 -0.197366
6 -0.29774385 -0.197366
str(Spam_ham_emails)
'data.frame':   4601 obs. of  16 variables:
 $ spam       : Factor w/ 2 levels "not spam","spam": 1 2 1 2 2 2 2 1 2 1 ...
 $ over       : num  -0.35 1.658 -0.35 0.636 0.344 ...
 $ remove     : num  -0.292 -0.292 -0.292 -0.292 0.194 ...
 $ internet   : num  -0.263 0.411 -0.263 -0.263 -0.263 ...
 $ free       : num  -0.3013 1.0307 -0.3013 -0.3013 -0.0713 ...
 $ hpl        : num  4.267 -0.299 2.666 -0.299 -0.299 ...
 $ X.         : num  -0.33 0.384 -0.33 -0.285 -0.07 ...
 $ X..1       : num  -0.308 0.875 -0.308 -0.308 0.924 ...
 $ CapRunMax  : num  -0.1702 -0.0881 -0.2267 -0.0984 0.4096 ...
 $ CapRunTotal: num  0.021 0.0127 -0.4062 -0.1242 0.7565 ...
 $ our        : num  -0.464 -0.464 -0.464 -0.464 -0.182 ...
 $ hp         : num  3.979 -0.329 -0.329 -0.329 -0.329 ...
 $ george     : num  -0.228 -0.228 -0.228 -0.228 -0.228 ...
 $ X1999      : num  4.99 -0.323 2.77 -0.323 -0.323 ...
 $ re         : num  0.5919 -0.0309 -0.2977 -0.2977 -0.2977 ...
 $ edu        : num  -0.197 -0.197 -0.197 -0.197 -0.197 ...

The response is the first column spam and we will use all the other features to predict if an email is a spam or ham. Let us first inspect the fraction of spam emails in the dataset.

cat("Percentage of spam:", 100*mean(Spam_ham_emails$spam == "spam"))
Percentage of spam: 39.40448

We now split the data using the createDataPartition() function in the caret package. The default method in caret is stratified sampling, which keeps the same fraction of spam and ham in both the training and test datasets. We use 75% of the data for training and 25% for testing.

set.seed(1234)
suppressMessages(library(caret))
train_obs <- createDataPartition(y = Spam_ham_emails$spam, p = .75, list = FALSE)

train <- Spam_ham_emails[train_obs, ]
test <- Spam_ham_emails[-train_obs, ]

# Confirm both training and test are balanced with respect to spam emails
cat("Percentage of training data consisting of spam emails:", 100*mean(train$spam == "spam"))
Percentage of training data consisting of spam emails: 39.40887
cat("Percentage of test data consisting of spam emails:", 100*mean(test$spam == "spam"))
Percentage of test data consisting of spam emails: 39.3913

The following code fits a logistic regression on the training data using the glm() function, predicts the test data following the rule if \(\mathrm{Pr}(y=1|\mathbf{x})>0.5 \Rightarrow y=1\), and computes the confusion matrix using the caret package.

glm_fit <- glm(spam ~ ., family = binomial, data = train)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
y_prob_hat_test <- predict(glm_fit, newdata = test, type = "response")
threshold <- 0.5 # Predict spam if probability > threshold
y_hat_test <- as.factor(y_prob_hat_test > threshold)
levels(y_hat_test) <- c("not spam", "spam")
confusionMatrix(data = y_hat_test, test$spam, positive = "spam")
Confusion Matrix and Statistics

          Reference
Prediction not spam spam
  not spam      658   65
  spam           39  388
                                          
               Accuracy : 0.9096          
                 95% CI : (0.8915, 0.9255)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8087          
                                          
 Mcnemar's Test P-Value : 0.01423         
                                          
            Sensitivity : 0.8565          
            Specificity : 0.9440          
         Pos Pred Value : 0.9087          
         Neg Pred Value : 0.9101          
             Prevalence : 0.3939          
         Detection Rate : 0.3374          
   Detection Prevalence : 0.3713          
      Balanced Accuracy : 0.9003          
                                          
       'Positive' Class : spam            
                                          

💪 Problem 6.1

Reconstruct the confusion matrix for the test data without using the confusionMatrix() function.

# Predictions
y_prob_hat_test <- predict(glm_fit, newdata = test, type = "response")
threshold <- 0.5
y_hat_test <- as.factor(y_prob_hat_test > threshold)
levels(y_hat_test) <- c("not spam", "spam")

# Create matrix
true_labels <- test$spam

# True Positives: Predicted spam and it's actually spam
TP <- sum(y_hat_test == "spam" & true_labels == "spam")

# False Positives: Predicted spam but it's actually not spam
FP <- sum(y_hat_test == "spam" & true_labels == "not spam")

# True Negatives: Predicted not spam and it's actually not spam
TN <- sum(y_hat_test == "not spam" & true_labels == "not spam")

# False Negatives: Predicted not spam but it's actually spam
FN <- sum(y_hat_test == "not spam" & true_labels == "spam")

# Create the confusion matrix
confusion_matrix <- matrix(c(TN, FP, FN, TP), nrow = 2, byrow = TRUE, dimnames = list("Actual" = c("not spam", "spam"), "Predicted" = c("not spam", "spam")))

print(confusion_matrix)
          Predicted
Actual     not spam spam
  not spam      658   39
  spam           65  388

💪 Problem 6.2

Compute the accuracy, precision, sensitivity (recall), and specificity without using the confusionMatrix() function. Explain these four concepts in the context of the spam filter.

# Calculate metrics from the confusion matrix
accuracy <- (TP + TN) / (TP + TN + FP + FN)
precision <- TP / (TP + FP)
sensitivity <- TP / (TP + FN) 
specificity <- TN / (TN + FP)

cat("Accuracy:", accuracy, "\n")
Accuracy: 0.9095652 
cat("Precision:", precision, "\n")
Precision: 0.9086651 
cat("Sensitivity (Recall):", sensitivity, "\n")
Sensitivity (Recall): 0.8565121 
cat("Specificity:", specificity, "\n")
Specificity: 0.9440459 

💪 Problem 6.3

Compute the ROC curve using the pROC() package. Explain in detail what the ROC curve shows.

library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
roc_curve <- roc(response = test$spam, predictor = y_prob_hat_test)
Setting levels: control = not spam, case = spam
Setting direction: controls < cases
plot(roc_curve, col = "blue", main = "ROC Curve")

auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC):", auc_value)
Area Under the Curve (AUC): 0.9650901
ROC curve shows the performance of the classifier for spam is pretty good. That the curve is bowed to top left suggests that the sensitivity is high-thus, catches most of the spam emails-while the false positive rate is low and incorrect spam classifications minimal. The fact that the curve sits well above the diagonal line, which is the line of a random guess, indicates how effective this classifier is.

7. Decision trees for classifying spam emails (1 mark)

💪 Problem 7.1

Using the training dataset from Problem 6, fit a decision tree (classification tree) using the tree package with the default settings. How does this classifier perform on the test dataset compared to the logistic classifier in Problem 6?

library(tree)

tree_fit <- tree(spam ~ ., data = train)

summary(tree_fit)

Classification tree:
tree(formula = spam ~ ., data = train)
Variables actually used in tree construction:
[1] "X."          "remove"      "X..1"        "george"      "hp"         
[6] "CapRunMax"   "our"         "free"        "CapRunTotal"
Number of terminal nodes:  14 
Residual mean deviance:  0.4606 = 1583 / 3437 
Misclassification error rate: 0.08056 = 278 / 3451 
y_hat_test_tree <- predict(tree_fit, newdata = test, type = "class")

# Convert predictions to factor with levels "not spam" and "spam"
y_hat_test_tree <- as.factor(y_hat_test_tree)
levels(y_hat_test_tree) <- c("not spam", "spam")

true_labels <- test$spam

TP_tree <- sum(y_hat_test_tree == "spam" & true_labels == "spam")
FP_tree <- sum(y_hat_test_tree == "spam" & true_labels == "not spam")
TN_tree <- sum(y_hat_test_tree == "not spam" & true_labels == "not spam")
FN_tree <- sum(y_hat_test_tree == "not spam" & true_labels == "spam")

# Create the confusion matrix
confusion_matrix_tree <- matrix(c(TN_tree, FP_tree, FN_tree, TP_tree), nrow = 2, byrow = TRUE,dimnames = list("Actual" = c("not spam", "spam"),"Predicted" = c("not spam", "spam")))

print(confusion_matrix_tree)
          Predicted
Actual     not spam spam
  not spam      652   45
  spam           66  387
accuracy_tree <- (TP_tree + TN_tree) / (TP_tree + TN_tree + FP_tree + FN_tree)
precision_tree <- TP_tree / (TP_tree + FP_tree)
sensitivity_tree <- TP_tree / (TP_tree + FN_tree) 
specificity_tree <- TN_tree / (TN_tree + FP_tree)

cat("Decision Tree Metrics:\n")
Decision Tree Metrics:
cat("Accuracy:", accuracy_tree, "\n")
Accuracy: 0.9034783 
cat("Precision:", precision_tree, "\n")
Precision: 0.8958333 
cat("Sensitivity (Recall):", sensitivity_tree, "\n")
Sensitivity (Recall): 0.8543046 
cat("Specificity:", specificity_tree, "\n")
Specificity: 0.9354376 

💪 Problem 7.2

Plot the tree structure in Problem 7.1.

library(tree)

tree_fit <- tree(spam ~ ., data = train)

#plot(tree_fit)
#text(tree_fit, pretty = 0)

if (!require(rpart)) install.packages("rpart")
if (!require(rpart.plot)) install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

rpart_fit <- rpart(spam ~ ., data = train, method = "class")

rpart.plot(rpart_fit, main = "Decision Tree for Spam Classification")

8. k-nearest neighbour for classifying spam emails (1 mark)

💪 Problem 8.1

Without using any package, fit a k-nearest neighbour classifier to the training dataset from Problem 6. Choose the value of \(k\) that minimises some suitable error function for the test data.

Tip

Note that the RMSE is not a suitable error function because the labels are binary. The missclassification rate is an alternative you can use. To choose the optimal value of \(k\), you may compute the missclassification rate of the test data for several candidate values of \(k\) and choose that which minimises this quantity.

To always ensure a majority vote, consider only odd numbers \(k\).

# Function to calculate Euclidean distance
euclidean_distance <- function(x1, x2) {
  sqrt(sum((x1 - x2)^2))
}

# Function to classify a single test point using k-NN
knn_classify <- function(test_point, train_data, train_labels, k) {
  distances <- apply(train_data, 1, function(train_point) euclidean_distance(test_point, train_point))

  neighbors <- order(distances)[1:k]
  
  neighbor_labels <- train_labels[neighbors]
  
  return(names(sort(table(neighbor_labels), decreasing = TRUE))[1])
}

# Function to perform k-NN classification on the test set
knn_predict <- function(test_data, train_data, train_labels, k) {
  apply(test_data, 1, function(test_point) knn_classify(test_point, train_data, train_labels, k))
}

load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/1/spam_ham_emails.RData')

# Standardize the features
Spam_ham_emails[, -1] <- scale(Spam_ham_emails[, -1])
Spam_ham_emails['spam'] <- as.factor(Spam_ham_emails['spam'] == 1) # Changing from 1->TRUE, 0->FALSE
levels(Spam_ham_emails$spam) <- c("not spam", "spam")

# Split the data into training and test sets
set.seed(1234)
library(caret)
train_obs <- createDataPartition(y = Spam_ham_emails$spam, p = .75, list = FALSE)
train <- Spam_ham_emails[train_obs, ]
test <- Spam_ham_emails[-train_obs, ]

train_data <- as.matrix(train[, -1])
train_labels <- train$spam
test_data <- as.matrix(test[, -1])
test_labels <- test$spam

# Define possible values for k (only odd values)
k_values <- seq(1, 9, by = 2)  
misclassification_rates <- numeric(length(k_values))

# Compute misclassification rates for each value of k
for (i in seq_along(k_values)) {
  k <- k_values[i]
  predictions <- knn_predict(test_data, train_data, train_labels, k)
  misclassification_rate <- mean(predictions != test_labels)
  misclassification_rates[i] <- misclassification_rate
  cat("k =", k, "Misclassification rate =", misclassification_rate, "\n")
}
k = 1 Misclassification rate = 0.1052174 
k = 3 Misclassification rate = 0.09217391 
k = 5 Misclassification rate = 0.0973913 
k = 7 Misclassification rate = 0.0973913 
k = 9 Misclassification rate = 0.1069565 
optimal_k <- k_values[which.min(misclassification_rates)]
cat("Optimal k:", optimal_k, "\n")
Optimal k: 3 

💪 Problem 8.2

How does the classifier in Problem 8.1 perform on the test dataset compared to the logistic classifier in Problem 6 and the decision tree classifier in Problem 7?

Tip

Since the k-nearest neighbour method only obtains a sharp prediction (i.e. 0 or 1 directly without predicting the probability first from which we can choose a threshold for classification), we cannot readily compute the ROC curve. Thus, obtain sharp predictions from Problems 6 and 7 first and use them for comparison to the k-nearest neighbour method.

y_prob_hat_test_logistic <- predict(glm_fit, newdata = test, type = "response")
threshold <- 0.5
y_hat_test_logistic <- as.factor(y_prob_hat_test_logistic > threshold)
levels(y_hat_test_logistic) <- c("not spam", "spam")

y_hat_test_tree <- predict(tree_fit, newdata = test, type = "class")
y_hat_test_tree <- as.factor(y_hat_test_tree)
levels(y_hat_test_tree) <- c("not spam", "spam")

predictions_knn <- knn_predict(test_data, train_data, train_labels, optimal_k)
y_hat_test_knn <- as.factor(predictions_knn)
levels(y_hat_test_knn) <- c("not spam", "spam")

# Function to compute metrics from confusion matrix
compute_metrics <- function(predictions, true_labels) {
  TP <- sum(predictions == "spam" & true_labels == "spam")
  FP <- sum(predictions == "spam" & true_labels == "not spam")
  TN <- sum(predictions == "not spam" & true_labels == "not spam")
  FN <- sum(predictions == "not spam" & true_labels == "spam")
  
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- TP / (TP + FP)
  sensitivity <- TP / (TP + FN) 
  specificity <- TN / (TN + FP)
  
  return(list(accuracy = accuracy, precision = precision, sensitivity = sensitivity, specificity = specificity))
}

metrics_logistic <- compute_metrics(y_hat_test_logistic, test$spam)
metrics_tree <- compute_metrics(y_hat_test_tree, test$spam)
metrics_knn <- compute_metrics(y_hat_test_knn, test$spam)
# Create a data frame for metrics
metrics_table <- data.frame(
  Metric = rep(c("Accuracy", "Precision", "Sensitivity (Recall)", "Specificity"), times = 3),
  Model = rep(c("Logistic Regression", "Decision Tree", "k-NN"), each = 4),
  Value = c(
    metrics_logistic$accuracy, metrics_logistic$precision, metrics_logistic$sensitivity, metrics_logistic$specificity,
    metrics_tree$accuracy, metrics_tree$precision, metrics_tree$sensitivity, metrics_tree$specificity,
    metrics_knn$accuracy, metrics_knn$precision, metrics_knn$sensitivity, metrics_knn$specificity
  )
)

# Print the table using knitr::kable for a nicely formatted output
if (!require(knitr)) install.packages("knitr")
Loading required package: knitr
library(knitr)

# Print the metrics table
kable(metrics_table, caption = "Comparison of Classifier Metrics", digits = 3)
Comparison of Classifier Metrics
Metric Model Value
Accuracy Logistic Regression 0.910
Precision Logistic Regression 0.909
Sensitivity (Recall) Logistic Regression 0.857
Specificity Logistic Regression 0.944
Accuracy Decision Tree 0.903
Precision Decision Tree 0.896
Sensitivity (Recall) Decision Tree 0.854
Specificity Decision Tree 0.935
Accuracy k-NN 0.908
Precision k-NN 0.876
Sensitivity (Recall) k-NN 0.892
Specificity k-NN 0.918
# Create the bar plot with reordered models
ggplot(metrics_table, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Comparison of Classifier Metrics", x = "Metric", y = "Value") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Here, we can see, after comparison, our logisctic regression has the best performance on accuracy, precision and specificity. The KNN model is a little better on sensitivity.

Footnotes

  1. The original data come from this source.↩︎

  2. The original data come from this source. The dataset used here includes a subset of the 57 original features.↩︎